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Biological Sequence Analysis 

T. P. Speed* 



Abstract 

This talk will review a little over a decade's research on applying certain 
stochastic models to biological sequence analysis. The models themselves have 
a longer history, going back over 30 years, although many novel variants have 
arisen since that time. The function of the models in biological sequence 
analysis is to summarize the information concerning what is known as a motif 
or a domain in bioinformatics, and to provide a tool for discovering instances 
of that motif or domain in a separate sequence segment. We will introduce the 
motif models in stages, beginning from very simple, non-stochastic versions, 
progressively becoming more complex, until we reach modern profile HMMs 
for motifs. A second example will come from gene finding using sequence data 
from one or two species, where generalized HMMs or generalized pair HMMs 
have proved to be very effective. 
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1. Introduction 

DNA (deoxyribonucleic acid), RNA (ribonucleic acid), and proteins are macro- 
molecules which are unbranched polymers built up from smaller units. In the case 
of DNA these units are the 4 nucleotide residues A (adenine), C (cytosine), G (gua- 
nine) and T (thymine) while for RNA the units are the 4 nucleotide residues A, C, 
G and U (uracil). For proteins the units are the 20 amino acid residues A (alanine), 
C (cysteine) D (aspartic acid), E (glutamic acid), F (phenylalanine), G (glycine), H 
(histidine), I (isoleucine) , K (lysine), L (leucine), M (methionine), N (asparagine), 
P (proline), Q (glutamine), R (arginine), S (serine), T (threonine), V (valine), W 
(tryptophan) and Y (tyrosine). To a considerable extent, the chemical properties 
of DNA, RNA and protein molecules are encoded in the linear sequence of these 
basic units: their primary structure. 
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The use of statistics to study linear sequences of biomolecular units can be 
descriptive or it can be predictive. A very wide range of statistical techniques has 
been used in this context, and while statistical models can be extremely useful, 
the underlying stochastic mechanisms should never be taken literally. A model or 
method can break down at any time without notice. Further, biological confirmation 
of predictions is almost always necessary. 

The statistics of biological sequences can be global or it can be local. For 
example, we might consider the global base composition of genomes: E. coli has 
25% A, 25% C, 25% G, 25% T, while P. falciparum has 82%A+T. At the very 
local, the triple ATG is the near universal motif indicating the start of translation 
in DNA coding sequence. A major role of statistics in this context is to characterize 
individual sequences or classes of biological sequences using probability models, 
and to make use of these models to identify them against a background of other 
sequences. Needless to say, the models and the tools vary greatly in complexity. 

Extensive use is made in biological sequence analysis of the notions of motif or 
domain in proteins, and site in DNA. We shall use these terms interchangeably to 
describe the recurring elements of interest to us. It is important to note that while 
we focus on the sequence characteristics of motifs, domains or sites, in practice they 
also embody (biochemical) structural significance. 

2. Deterministic models 

The C2H2 (cysteine-cysteine histidine-histidine) zinc-finger DNA binding do- 
main is composed of 25-30 amino acid residues including two conserved cysteines 
and two conserved histidines spaced in a particular way, with some restrictions on 
the residues in between and nearby. Of course the arrangement reflects the three- 
dimensional molecular structure into which the amino-acid sequence folds, for it is 
the structure which has the real biochemical significance, see Figure ^ which was 
obtained from http : / / www . rcsb . org/pdb/ An example of this motif is the 27- 




Figure 1: A C2H2 zinc finger DNA binding domain 
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letter sequence known as IZNF, this being a Protein Data Bank identifier for the 
structure XFIN-31 of X. laevis. Its amino acid sequence is 

IZNF : XYKCGLCERSFVEKSALSRHQRVHKNX 

Note the presence of the two Cs separated by 2 other residues, and the two Hs 
separated by 3 other residues. Here and elsewhere, X denotes an arbitrary amino 
acid residue. A popular and useful summary description of C2H2 zinc fingers which 
clearly includes our example, is the regular expression 

C ~ X{2, 4) - C - X{3) ~ [LIVMFYWC] - X{8) -H - X{3, 5) - H 

where X{m) denotes a sequence of n unspecified amino acids, while X(m, n) denotes 
from m to n such, and the brackets enclose mutually exclusive alternatives. There is 
a richer set of notation for regular expressions of this kind, but for our purposes it is 
enough to note that this representation is essentially deterministic, with uncertainty 
included only through mutually exclusive possibilities (e.g. length or residue) which 
are not otherwise distinguished. 

Simple and efficient algorithms exist for searching query sequences of residues 
to find every instance of the regular expression above. In so doing with sequence 
in which all instances of the motif are known, we may identify some sub-sequences 
of the query sequence which are not C2H2 zinc finger DNA binding domains, i.e. 
which are false positives, and we may miss some sub-sequences which are C2H2 
zinc fingers, i.e. which are false negatives. Thus we have essentially deterministic 
descriptions and search algorithms for the C2H2 motifs using regular expressions. 
Their performance can be described by the frequency of false positives and false 
negatives, equivalently, their complements, the specificity and sensitivity of the 
regular expression. We do not have space for an extensive bibliography, so for more 
on regular expressions and on most of the other concepts we introduce below, see 


3. Regular expressions can be limiting 

Most protein binding sites are characterized by some degree of sequence speci- 
ficity, but seeking a consensus DNA sequence is often an inadequate way to rec- 
ognize their motifs. Simply listing the alternatives seen at a position may not be 
very informative, but keeping track of the frequencies with which the different al- 
ternatives appear can be very valuable. Thus position-specific nucleotide or amino 
acid distributions came to represent the variability in DNA or protein motif com- 
position. This is just the set of marginal distribution of letters at each position. 
Rather than present an extensive tabulation of frequencies for our C2H2 zinc fin- 
ger example, we present a pictorial representation: a sequence logo coming from 
jhttp : //blocks . f here . or^ 

Sequence logos are scaled representation of position-specific nucleotide or amino 
acid distributions. The overall height at a given position is proportional to infor- 
mation content, which is a constant minus the entropy of the distribution at that 
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Figure 2: Sequence logo for C2H2 zinc finger 

position. The proportions of each nucleotide or amino acid at a position are in re- 
lation to their observed frequency at that position, with the most frequent on top, 
the next most frequent below, etc. 

4. Profiles 

It is convenient for our present purposes to define a profile as a set of position- 
specific distributions describing a motif. (Traditionally the term has been used for 
the derived scores.) How would we use a set of such distributions to search a query 
sequence for instances of the motif? The answer from bioinformatics is that we 
score the query sequence, and for suitably large scores, declare that a candidate 
subsequence is an instance of our motif. 

There are a number of approaches for deriving profile scores, but the easiest to 
explain here is this: scores are log-likelihood ratio test statistics, for discriminating 
between a probability model M for the motif and a model B for the background. 
The model M will be the direct product of the position-specific distributions, (i.e. 
the independent but not identical distribution model) , while the background model 
B will be the direct product of a set of relevant background frequencies (i.e. the 
independent and identical distribution model). Thus, if fai is the frequency of 
residue a at position I of the motif, and fa background frequency of the same 
residue, then the profile score assigned to residue a at position I in a possible 
instance of the motif will be Sai = log /a///a- These scores are then summed across 
the positions in the motif, and compared to a suitably defined threshold. Note 
that proper setting of the threshold requires a set of data in which all instances 
of the motif are known. The false positive and false negative rate could then be 
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determined for various thresholds, and a suitable choice made. 

We briefly discuss variants of the log-likelihood ratio scores. In many contexts, 
it will matter little whether a position is occupied by a leucine (L) rather than an 
isoleucine (/), as each can evolve in time to or from the other rather more readily 
than from other residues. Thus it might make sense to modify the scores to take 
this and similar evolutionary patterns into account. Indeed the first use of profiles 
involved scores of this kind, using the position specific amino acid distribution 
of an alignment of instances of the motif and entries from what are known as 
PAM matrices, which embody patterns of molecular evolution. In addition, the 
background distribution of residues may be modelled more detailed manner, e.g. 
using the so-called Dirichlet mixture models. 

It is also possible to include position-specific scores for insertion and deletion 
of residues, relative to a consensus pattern. When these are used, the scoring 
becomes a little more subtle, as the problem is then quite analogous to pairwise 
sequence alignment, but with position dependent scoring parameters for matches, 
mismatches, insertions and deletions. 

We summarise this section by noting that probability has entered into our 
description through the use of frequencies, and scores based on them, but so far we 
do not have global statistical models, at least not ones embodying insertions and 
deletions, on which we base our estimation and testing. These are all part of the 
use of profile HMMs, but first we introduce HMMs. 

5. Hidden Markov models 

Hidden Markov models (HMMs) are processes {St,Ot),t — 1, . . . ,T, where St 
is the hidden state and Ot the observation at time t. Their probabilistic evolution 
is constrained by the equations 



The definitions and basic facts concerning HMMs were laid out in a series of beauti- 
ful papers by L. E. Baum and colleagues around 1970, see for references. Much 
of their formulation has been used almost unchanged to this day. Many variants 
are now used. For example, the distribution of O may not depend on previous 5', 
or it may also depend on previous O values. 



Most importantly for us below, the times of S and O may be decoupled, permit- 
ting the observation corresponding to state time t to be a string whose length and 
composition depends on St (and possibly 5*4-1 and part or all of the previous ob- 
servations). This is called a hidden semi-Markov or generalized hidden Markov 
model. 



pr{St\St-i,Ot-i,St-2,Ot-2, ■ ■ •) 
pr{Ot\St-i,Ot-i, St-2,Ot-2, ■ ■ •) 



pr{St\St^i), 
pr{Ot\St,St-i). 



pr{Ot\St, St-i, Ot-i, • ■ •) 
pr{Ot\St, St^i, Ot-i, • ■ •) 



pr{Ot\St), or 
pr{Ot\St,St-i,Ot-i). 
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Early applications of HMMs were to finance, but these were never published, 
to speech recognition, and to modelling ion channels. In the mid-late 1980s HMMs 
entered genetics and molecular biology, where they are now firmly entrenched. One 
of the major reasons for the success of HMMs as stochastic models is the fact that 
although they are substantial generalizations of Markov chains, there are elegant 
dynamic programming algorithms which permit full likelihood calculations in many 
cases of interest. Specifically, there are algorithms which permit the efficient calcu- 
lation of a) pr{sequence\M) , where sequence is a sequence of observations and M 
is an HMM; b) the maximum over states oi pr{states\sequence, M), where states is 
the unobserved state sequence underlying the observation sequence; and c) the max- 
imum likelihood estimates of parameters in M based on the observation sequence. 
Step c) is carried out by an iterative procedure which in the case of independent 
states was later termed the EM algorithm. 

6. Profile HMMs 

In a landmark paper A. Krogh, D. Haussler and co-workers introduced profile 
HMMs into bioinformatics. An illustrative form of their profile HMM architecture 
is given in Figure 01 There we depict the underlying state space of the hidden 




Figure 3: State space of a simple profile HMM 

Markov chain of a profile HMM of length 4, with M denoting match states, / insert 
states and D delete states, while B and E are begin and end states, respectively. 
Encircled states (D, B and E) do not emit observations, while each of the match 
and insert states will have position-specific observation or emission distributions. 
Finally, each arrow will have associated transition probabilities, with the expecta- 
tion being that the horizontal transition probabilities are typically near unity. This 
the chain proceeds from left to right, and if it remains within match states, its 
output will be an amino acid sequence of length 4. Deviation to the insert or delete 
states will modify the output accordingly. The similarity with a direct product of 
a sequence of position-specific distributions should be unmistakeable. The profile 
HMMs in use now have considerably more features, while sharing the basic M,I 
and D architecture. 



Biological Sequence Analysis 



103 



Why was the introduction of the HMM formalism such an advance? The 
answer is simple: it permitted the construction and application of profiles to be 
conducted entirely within a formal statistical framework, and that really helped. 
Instances of the motif embodied in an HMM could be identified by calculating 
pr{sequence\M) /pr(sequence\B) as was done with profiles, using the algorithm for 
problem a) in X above. Instances of the motif could be aligned to the HMM by 
calculating the most probable state sequence giving rise to the motif sequence, in 
essence finding the most probable sequence of matches, insertions, deletions which 
align the given sequence to the others which gave rise to the HMM, cf. problem b) 
above. And finally, the parameters in the HMMs could be estimated from data com- 
prising known instances of the motif by using maximum likelihood, an important 
step for many reasons, one being that it put insert and delete scores on precisely the 
same footing as match and mismatch scores. Although the estimation of HMM pa- 
rameters is easiest if the example sequences are properly aligned, the EM algorithm 
(problem c) above) does not require aligned sequences. 

In the years since the introduction of profile HMMs, they have been become 
the standard approach to representing motifs and protein domains. The database 
Pfam (http://pfajn.wustl.edu) now has 3,849 hidden Markov models (May 2002) 
representing recognized protein or DNA domains or motifs. Profile HMMs have es- 
sentially replaced the use of regular expressions and the original profiles for searching 
other databases to find novel instances of a motif, for finding a motif or domain 
match to an input sequence, and for aligning a motif or domain to a an existing 
family. There is considerable evidence that the HMM-based searches are more pow- 
erful than the older profile based ones, though they are slower computationally, and 
at times that is an important consideration. 

7. Finding genes in DNA sequence 

Identifying genes in DNA sequence is one of the most challenging, interesting 
and important problems in bioinformatics today. With so many genomes being 
sequenced so rapidly, and the experimental verification of genes lagging far behind, 
it is necessary to rely on computationally derived genes in order to make immediate 
use of the sequence. 

What is a gene? Most readers will have heard of the famous central dogma 
of molecular biology, in which the hereditary material of an organism resides in its 
genome, usually DNA, and where genes are expressed in a two-stage process: first 
DNA is transcribed into a messenger RNA (mRNA) sequence, and later a processed 
form of this sequence is translated into an amino acid sequence, i.e. a protein. In 
general the transcribed sequence is longer than the translated portion: parts called 
introns (intervening sequence) are removed, leaving exons (expressed sequence), of 
which only some are expressed, while the rest remain untranslated. The translated 
sequence comes in triples called codons, beginning and ending with a unique start 
(ATG) and one of three stop (TAA, TAG, TGA) codons. There are also character- 
istic intron-exon boundaries called splice donor and acceptor sites, and a variety of 
other motifs: promoters, transcription start sites, polyA sites, branching sites, and 
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so on. 

All of the foregoing have statistical characterizations, and in principle they 
can all help identify genes in long otherwise unannotated DNA sequence segments. 
To get an idea of the magnitude of the task with the human genome, consider the 
following facts about human gene sequences the coding regions comprise about 
1.5% of the entire genome; the average gene length is about 27,000 bp (base pair); 
the average total coding region is 1,340 bp; and the average intron length is about 
3,300 bp. Further, only about 8% of genes have a single exon. We see that the 
information in human genes is very dispersed along the genome, and that in general 
the parts of primary interest, the coding exons, are a relatively small fraction of the 
gene, on average about 



8. Generalized HMMs for finding genes 

The HMMs which are effective in finding genes are the generalized HMMs 
(GHMMs) described in section [3 above. Space does not permit our giving an ad- 
equate description here, so we simply outline the architecture of Genscan pP one 
of the most widely used human genefinders. States represent the gene features 
we mentioned above: exon, intron, and of course intergenic region, and a variety of 
other features (promotor, untranslated region, polyA site, and so on. Output obser- 
vations embody state-dependent nucleotide composition, dependence, and specific 
signal features (such as stop codons). In a GHMM the state duration needs to be 
modelled, as well as two other important features of genes in DNA: the reading 
frame, which corresponds to the triples along the mRNA sequence which are se- 
quentially translated, and the strand, as DNA is double stranded, and genes can be 
on either strand, i.e. they can point in either direction. These features can be seen 
in Figure 01 which was kindly supplied by Lior Pachter. 

The output of a GHMM genefinder after processing a genomic segment is 
broadly similar to that from a profile HMM after processing an amino acid sequence: 
the most probable state sequence given an observation sequence is a best gene 
annotation of that sequence, and a variety of probabilities can be calculated to 
indicate the support in the observation sequence for various specific gene features. 

9. Comparative sequence analysis using HMMs 

The large number of sequenced genomes now available, and the observation 
that functionally important regions are evolutionarily conserved, has led to efforts to 
incorporate conservation into the models and methods of biological sequence anal- 
ysis. Pair HMMs were introduced in as a way of including alignment problems 
under the HMM framework, and recently |4| they were combined with GHMMs 
(forming GPHMMs) to carry out alignment and genefinding with homologous seg- 
ments of the mouse and human genomes. Use of the program SLAM on the whole 
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Figure 4: Forward half of the Genscan GHMM state space 



mouse genome (http : //bio . m ath . berkeley . edu/slam/inouse/| ) demonstrated the 
value of GPHMMs in this context. 

10. Challenges in biological sequence analysis 

The first challenge is to understand the biology well enough to begin biological 
sequence analysis. This part will frequently involve collaborations with biologists. 
With HMMs, GHMMs and GPHMMs, designing the underlying architecture, and 
carrying out the modelling for the components parts, e.g. for splice sites in genefind- 
ing GHMM is perhaps the next major challenge. Undoubtedly the hardest and most 
important task of all is the implementation: coding up the algorithms and making 
it all work with error-prone and incomplete sequence data. Finally, it is usually a 
real challenge to find good data sets for calibrating and evaluating the algorithms, 
and for carrying out studies of competing algorithms. 

For a recent example of this process, which is a model of its kind, see 0. There 
an HMM is presented for the so-called recognition sites, which involve two DNA 
motifs separated by a variable number of base pairs. In addition to the examples 
mentioned so far, there are many more HMMs in the bioinformatics literature, see 
p. 79 of P] for ones published before 1998. 



11. Closing remarks 
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In this short survey of biological sequence analysis, I have simply touched on 
some of the major ideas. A much more comprehensive treatment of material covered 
here can be found in the book [2|, whose title not coincidentally is the same as 
that of this paper. Many important ideas from biological sequence analysis have 
not been mentioned here, including molecular evolution and phylogenetic inference, 
and the use of stochastic context-free grammars, a form of generalization of HMMs 
suited to the analysis of RNA sequence data. 

At this Congress I have talked (and am now writing) on the research of others, 
in an area in which my own contributions have been negligible. I chose to do so 
upon being honoured by the invitation to speak at this Congress because I believe 
this topic - HMMs - to be one of the great success stories of applying mathematics 
to bioinformatics. In my view it is the one most worthy of a wider mathematical 
audience. I hope that the fact that there are many others better suited than me to 
speak on this topic will not prevent readers from appreciating it and following it up 
through the bibliography. 

I owe what understanding I have of this field to collaborations and discussions 
with a number of people, and I would like to acknowledge them here. Firstly, Tony 
Wirth, Simon Cawley and Mauro Delorenzi, with whom I have worked on HHMMs. 
Next, it has been an honour and pleasure to observe from close by the development 
of SLAM, by Simon Cawley, Lior Pachter and Marina Alexandersson. Finally I'd 
like to thank Xiaoyue Zhou and Ken Simpson for their kind help to me when I was 
preparing my talk and this paper. 
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